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We suggest that the thermodynamic stability parameters (nearest neighbor stacking and hydrogen bond- 
ing free energies) of double-stranded DNA molecules can be inferred reliably from time series of the size 
fluctuations (breathing) of local denaturation zones (bubbles). On the basis of the reconstructed bubble 
size distribution, this is achieved through stochastic optimization of the free energies in terms of Simulated 
Annealing. In particular, it is shown that even noisy time series allow the identification of the stability 
parameters at remarkable accuracy. This method will be useful to obtain the DNA stacking and hydrogen 
bonding free energies from single bubble breathing assays rather than equilibrium data. 

Keywords: DNA breathing, heteropolymer, stochastic optimization, Simulated Annealing, robustness 



I. IIMTRODUCTIOIM 

The Watson-Crick double-helical form of DNAi is not 
a static structure: even at standard salt conditions and 
room temperature the base pairs may intermittently open 
up and expose the otherwise protected core of the nu- 
cleotides. Such local denaturation bubbles are usually 
quite short-lived, however, the propensity of double- 
stranded DNA towards formation of longer-lived bubbles 
can be increased by elevating temperature or lowering the 
salt concentratioujSr^ In naturally underwound circular 
DNA denaturation bubbles are stabilized by partial twist 
releasefiiS, while in modern single DNA molecule setups 
bubble formation may be facilitated by the exertion of 
longitudinal stretching forces<^""— The preferred location 
of bubbles is connected with the stability landscape of 
the genome, as quantified by maps of stability parame- 
ters, which are functions of the specific, underlying se- 
quence of GC and AT base pairs. ^^"^^ In a biological 
context, bubbles correspond to so-called DNA Unwind- 
ing Elements (DUE) , which are central in processes such 
as gene regulation, DNA replication, and transcription.^^ 
Similarly, in higher organisms the thermodynamic stabil- 
ity landscape of DNA is related to the coding versus non- 
coding properties of the genomei ^°i^^ The denaturation 
of a long DNA chain from double-strand to two separate 
single-strands is a physical phase transition, whose order 
is determined by the magnitude of the critical exponent 
c for the entropy loss of a flexible polymer loop, see the 
discussion below i^^— i^^'^^i^^ The opening-closing dynam- 
ics of denaturation bubbles can be quantified by simple 
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nonequilibrium models based on the gradient of the DNA 
stability free energy landscape)^"— 

Melting profiles of DNA can be obtained from a host 
of experimental techniques. These include UV spectro- 
scopic methods,'^- circular dichroism,^ fluorescence reso- 
nant energy transfer measurements,^^ calorimetry,^ or 
nuclear magnetic resonance;^ among others. Single 
DNA manipulation techniques such as unzipping have 
recently been shown to provide high accuracy results 
for the stability parameters and their salt dependences^ 
From the respective melting or unzipping curves the 
DNA stability parameters are deduced, which in bioin- 
formatics serve to predict the melting proflles of arbi- 
trary, given DNA sequences)^ Up until now the differ- 
ent sets of stability parameters differ considerably from 
each other..^-"i8. Alternative methods to measure these 
may help to pin down optimized parameters. One way 
could be to use dynamic information from bubble breath- 
ing. Indeed, by fluorescence correlation spectroscopy the 
breathing dynamics of single DNA bubbles has been mon- 
itored, producing the breathing-induced fluorescence- 
fluorescence correlation function, that is pronouncedly 
non-exponentiali^2iM Given the recent progress in exper- 
imental methods, we expect that time series of single 
bubble dynamics will soon become available, in which 
opening or closing events of individual base pairs can 
be monitored. A high potential for such time records 
lies in nano-channel approaches as the one reported in 
Ref. m, after new labeling techniques will become avail- 
able shortly. 

In what follows we pursue the question whether the 
bubble size distribution obtained from single breathing 
time series may, in principle, be used to obtain reliable 
information on the DNA stability parameters. We show 
that indeed by stochastic analysis methods such as Simu- 
lated Annealing (SA) accurate estimates for the stability 
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parameters may be obtained for known DNA sequences. 

The paper is structured as follows. We first introduce 
the general statistical model of DNA base pairing, be- 
fore proceeding to present the methodology of SA. In the 
subsequent section we present our results, before drawing 
our conclusions. 



II. STATISTICAL MODEL FOR DNA DENATURATION 
A. Thermodynamics 

The size of denaturation bubbles typically ranges from 
a few broken base pairs (bps) at physiological tem- 
perature in linear, unconstrained DNA, to some 200 
broken bps closer to the melting temperature of the 
DNA.— i^i^ii^i^ Bubbles of some hundred broken base 
pairs also occur in naturally underwound DNAi^iiS Fol- 
lowing the notation of Ref. [l4|, the stability of DNA is 
characterized by the free energies e?ib(AT) and ehb(GC) 
for the Watson-Crick hydrogen bonds between comple- 
mentary nucleotides (A and T, G and C, respectively) as 
well as the independent stacking free energies Cst for dis- 
rupting the stacking interactions between nearest neigh- 
bor bps. These stacking energies depend on the nature of 
the two vicinal bps, as well as on their orientation along 
the DNA molecule (3' to 5'). The free energies are func- 
tions of temperature and salt concentration. Depending 
on the used set of stability parameters more or less pro- 
nounced asymmetries in the stacking free energies are 
observed.—"— In addition to the hydrogen bonding and 
stacking free energies, there is an additional energetic 
cost for initiating a bubble in the first place. Roughly 
speaking, this term originates from the fact that two 
stacking contacts need to be broken, while only one single 
broken bp yields an entropic gain. This is either taken 
into consideration by the cooperativity factor ctq, or the 
so-called ring factor ^, see below 

The L33B9 sequence^ we are analyzing in the present 
work is given as follows, 

5' - cCGCCAGCGGCCTTTACTAAAGGCCGCTGCGCc - 3', (1) 

where the double-strand is completed by adding the com- 
plementary single strand. The sequence ([T]) is linear, and 
the high content of more stable GC bps at the two ends 
ensures that these ends preferentially remain closed. A 
denaturation bubble forms in the center of the chain that 
is rich in weaker AT bonds. We therefore view the two 
extremities denoted by the lower case symbol c as com- 
pletely clamped. Labeling the sequence of bps by the 
coordinate x, ranging from x — Q io x = M + 1, we thus 
have M = 31 internal bps, which are allowed to open up, 
while the bps at a; = and x = M + 1 remain closed 
by definition. In a mathematical sense, the bps at the 
two extremities represent reflecting boundary conditions. 
Furthermore, we call xl and Xf> the momentary positions 
of the two closed bps embracing the denaturation bubble 



to the left and right, such that the bubble size becomes 
m — xji — xl — 1. In terms of the Boltzmann factors for 
hydrogen bonding of the bp at position x, 

u/,b(a;) = exp ( -^-^ I , (2) 

and the stacking interactions between the bps at posi- 
tions x — \ and X, 

u^t(a;) =exp ^^^^ , (3) 

the bubble partition function becomes (m 5^ 1): 

^1 XL+m XL+rn+l 

^{xL,m)^— TT Uhb(a;) TT Ust{x). 

1 -I- m)^ 

(4) 

At TO = 0, we take 3f{m = 0) = 1. In Eq. (g]), the factor 
(l-t-m)""^ takes care of the entropy loss upon formation of 
a closed polymer loop. For a self-avoiding chain in three 
dimensions, the critical exponent becomes c = 1.76 
Corrections of c may occur due to interactions with the 
rest of the chain^Sl however, for the short DNA construct 
used here, such effects are not expected to be relevant. 
The ring factor is ^ « 10"^,™ and we define = 2"^^. 
The ring factor may be interpreted as the cooperativity 
parameter, divided by the Boltzmann factor for stack- 
ing, ^ = ao/ exp{est/kBT)M In principle, the ring factor 
depends on the position. However, a bubble will sta- 
tistically always form at the weakest link. Considering 
this we have used a constant value of ring factor, ^ in 
the present work. With above notation, the equilibrium 
distribution for finding a bubble of size to and with the 
leftmost broken bp located at position x + 1, is given by 

p f ^ ^{xL,m) 

^W + 2l^=i2ZxL=o ^[XL,m) 

B. Nonequilibrium: bubble breathing 

Powered by thermal fluctuations, the bubble size be- 
comes a random process as a function of time. Varying 
stepwise by further unzipping of one bp at position or 
Xfl, or by zipping ai xl + ^ and xr — 1, the bubble size m 
performs a random walk along the coordinate x, the bub- 
ble breathing dynamics i^"—i^ This process is described 
by the master equation^! 

dP{xL,m,t) 

^—^ =WP(xl,to,<), (6) 

where P{xl, fn, t) is the probability distribution for find- 
ing a bubble of size m with the leftmost open bp at po- 
sition XL -|- 1, at time t. The matrix W contains the 
transfer rates for all possible transitions in the {xL,m) 
space, for details see Ref. H^. In the long time limit, 
the solution P of the master equation ^ equilibrates to 
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FIG. 1. (Color online) Theoretical probability distribution for 
finding a tagged bp at position xt open (solid line), compared 
with the result from the converged SA scheme (blue open 
squares). The underlying DNA sequence is given in Eq. 



the distribution Pcq of Eq. ([5]). To generate individual 
bubble breathing time series for m(t) and XL{t), as well 
as construct the distribution P^q, one may employ the 
Gillespie algorithm i^^'^^ 

Following the experimental setup in Ref. IH, one may 
study the dynamics of a tagged bp located ai x — xt- 
In the typical experimental scenario fluorescence occurs 
if the bps in a (^-neighborhood of the fluorophore posi- 
tion XT are open. Measured fluorescence time series thus 
correspond to the stochastic variable I{t), with the prop- 
erties I{t) = 1 if at least all bps in {xt — 5,xt + are 
open, and I{t) — otherwise.'^* In what follows we probe 
whether a single bp is open or closed, i.e., we choose 
<5 = 0. 



III. STOCHASTIC OPTIMIZATION 

Given the probability distribution Pcq{'m^XL)^ con- 
structed from an experimental or simulations time series 
m(i), XL{t), for a bubble in the DNA construct under 
consideration: can we reliably extract the stability pa- 
rameters? Here we show that stochastic optimization is 
the method of choice. 

Finding system parameters in a complex landscape is 
a generic task across disciplines 42"J§. Typically, a given 
problem is cast in such a manner that the seeked-for op- 
timum corresponds to an extremum of a functional in 
the complex search space. For instance, to obtain the 
global minimum in a rugged potential energy surface, 
one starts from any arbitrary point on this landscape 
and then moves on in the search space, following cer- 
tain rules, such as accepting a move if the gradient norm 
for the new position decreases. This process converges 
to a point for which the gradient norm is zero. To ver- 



ify whether this point is a minimum, one needs to check 
if the eigenvalues of the Hessian matrix at that point 
are all positive. A completely deterministic optimiza- 
tion procedure such as this minimization of the gradient 
norm, however, will generally fail to determine the global 
minimum if the search space features multiple minima. 
Once a local minimum is found, the deterministic search 
method will simply terminate. Such a misguidance is 
avoided by true global optimizers, whose search is not 
solely driven by a gradient. In particular, stochastic opti- 
mization techniques turn out to be very successful. Orig- 
inally proposed by Kirkpatrick and coworkers to solve 
the traveling salesman problem^^i^ SA represents such 
a true global optimizer, and has been applied to a broad 
range of problems across disciplines, see, for instance, 
Refs. l49l - [57l In SA, the search space is initially sam- 
pled at a high temperature (Tat). The associated ther- 
mal fluctuations at a suitable value of Tat will lift the 
optimizer out of local minima such that the search may 
continue towards increasingly deeper minima. Once the 
temperature becomes sufficiently small and/or the search 
is carried out over a sufficient time span, the entire search 
space is probed. Due to this ergodic property the global 
minimum is indeed found unequivocallyj^ 

Typically, an SA analysis is started at a sufficiently 
high temperature. This makes nearly all moves accept- 
able, as the criterion for accepting or rejecting a move 
is determined by the Metropolis criterion. In our case, 
the associated cost function, which is being minimized, is 
the sum of the squares of the difference of the occupation 
probabilities at the various positions, 

M 

cost, = ^(Peq(x,)-TT.,(a:.))^ (7) 

where Py^t (^) denotes the distribution at position x 
found in the current SA step, when the simulation tem- 
perature is Tat. If, on going from one SA step (i) to the 
next (i + l) the magnitude of the cost function decreases, 
we at once accept that move. If it increases, we do not 
discard the move rightout. Instead, we subject it to the 
Metropolis teslj^: if the quantity A = cost^ — cost^^i has 
a positive value, the probability for accepting the move 
is determined by the function 

F^»p(-A). (8) 

For positive A, P is always between and 1. For each 
evaluation of P, we invoke a random number rand be- 
tween and 1. If P > rand, we accept the move. If 
not, the move is rejected. Thus, at very high Tat, F 
will be close to 1 and most moves will be accepted, such 
that a greater region of the search space will be sam- 
pled. As the simulation proceeds. Tat is decreased by the 
annealing schedule. Once the correct path towards the 
global minimum is followed, we need not search the en- 
tire space and concentrate on a small region, which will 
guide us specifically to the global minimum. That is, as 
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Tat is lowered, a decreasing number of moves pass the 
Metropolis test. Ultimately, in our problem we recover 
the stability parameters from the SA analysis. 

In SA, the crucial factor which determines the success 
of optimization is the annealing schedule, which is ba- 
sically the rate at which the simulation temperature is 
decreased in successive annealing steps. In the present 
study we have kept the initial temperature at 1000. The 
rate of cooling was kept at 10% of the value of the 
present step. We have also ensured that after every 30 SA 
steps, the system is re-heated to the initial starting value, 
i.e., the simulation temperature is forcibly increased to a 
higher value. This is done to remove any possibility of 
being trapped in a local minimum (coming out of which 
will be difficult if the simulation temperature is low). In 
successive SA steps, along with the temperature, the indi- 
vidual stability parameters are changed by the following 
strategy. If m is a parameter chosen for change in SA, 
it is updated by the rule: u = u + u x (—1)" x 5 x r„, 
where n is a random integer, S is the amplitude of al- 
lowed change (kept at 0.01), and r„ is a random number 
between and 1. The new u' (changed stability parame- 
ter) is used to generate the updated distribution profile. 
The magnitudes of the different optimization parameters 
are collected in Table HI 
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FIG. 2. (Color online) Evolution of the free energy parameters 
of hydrogen bonding and base stacking as function of SA steps 
(full lines) from three separate SA runs. The black dashed 
horizontal lines represent the expected experimental values 
taken from Ref. UA . towards which convergence is expected 
to occur. Note the different scales on the vertical axes. The 
values for two pairs of bps, AT- AT and GA-TC, do not change 
in the SA procedure; these two pairs do not occur in the 
underlying sequence ^ and are thus not subject to the SA 
optimization criteria, i.e., they do not converge. 



TABLE I. Magnitude of optimization parameters used in SA. 



Parameter 


Magnitude 


Annealing Schedule 


10% 


Initial Simulation Temperature 


1000 


Magnitude of Change 5 


0.01 



IV. RESULTS AND DISCUSSION 

In a first step, the equilibrium distribution for a tagged 
bp at location xt in the DNA sequence H]) was de- 
termined from the theoretical stability parameters from 
Ref. [13. SA was then employed for successive conver- 
gence of Ptj^i to this theoretical distribution through vari- 
ation of the 12 independent free energy parameters (com- 
pare Table nil, by minimizing the cost function. The SA 
analysis was terminated once the value of the cost func- 
tion becomes smaller than 10"''. Fig. 1 shows the quite 
accurate convergence of the SA scheme in terms of the 
equilibrium distribution. 

To visualize the progress of the SA procedure, we dis- 
play in Fig. [5] the progress of the approximation of the 
twelve DNA stability parameters of hydrogen bonding 
and base stacking (compare also Table |n| for 8000 SA 
steps, for three separate SA runs starting with different 
initial simulation temperatures. For each simulation the 
initial free energy values are chosen via random pertur- 
bation of the experimental u values;^ following our SA 
strategy. In all cases the convergence is quite accurate. 
Two parameters do not change during the SA scheme, 
these correspond to the two pairs of bps, that do not oc- 
cur in the employed sequence ([ij. To be sure that the 
search proceeds without being held up in local basins, the 
annealing temperature was raised after every 30 SA steps 
and then allowed to follow the usual annealing schedule. 
The sudden jumps in the profile are a result of this effort. 
At an abruptly elevated temperature, newer moves start 
to get accepted and hence the zigzag pattern. 

In terms of the free energy values for hydrogen bond- 
ing and base stacking, the average results from 1000 SA 
runs are shown in Table [III We also indicate which com- 
binations of nearest neighbor pairs actually occur in the 
underlying sequence ([T]) . The convergence of the S A al- 
gorithm in all cases is quite remarkable. In addition to 
the free energy parameter we also optimized the loop ex- 
ponent c and the ring factor ^. The resultant simulation 
profiles (Fig. 3) show a good convergence towards theo- 
retical values. 

In typical experimental data the distribution of the 
bubble opening probability will be noisy, due to finite 
sampling and measurement errors. To check if our SA 
algorithm is robust against such noise we randomly per- 
turbed the theoretically expected equilibrium distribu- 
tion by a gaussian random processes with amplitude and 
width being the Peq and 10% of Peg, respectively. Fig. 4 
shows how this noisy data was quickly smoothened out to 
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TABLE II. Comparison of experimental—" and simulated free 
energy data. Each simulation data is a mean of 1000 different 
SA outputs. The rightmost column shows the presence {^) or 
absence ( x ) of particular free energies in sequence lU . Units 
of free energies {est and e^b) reported here are kcal/mol. The 
last two rows of the table gives a comparison of the ring factor 
^ and critical exponent c. 

Experimental SA results 

est (AT- AT) -1.729409 -1.767474 x 

est (TA-TA) -0.579800 -0.588968 ^ 

est (AA-TT) -1.499484 -1.510239 ^ 

est (GA-TC) -1.819371 -1.798201 x 

est (CA-TG) -0.939677 -0.922743 ^ 

est (AG-CT) -1.455363 -1.462615 ^/ 

est (AC-GT) -2.199241 -2.175124 ^ 

est (GG-CC) -1.829370 -1.801741 ^ 

est (CG-CG) -1.299554 -1.318516 ^ 

est (GC-GC) -2.559130 -2.549840 ^ 

ehh (AT) 0.649775 0.651781 ^ 

ehb (GC) 0.129955 0.113848 ^ 

^ 0.001 0.001034062 

c 1.76 1.758298 
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FIG. 4. (color online) Plot of PT^ti^r) against xt at various 
SA steps. In each panel the red solid lines represents the 
original noisy data, and the blue dashed line is the output 
of SA runs. The blue open squares stand for the theoretical 
distribution Pcqixx)- The plot for 1,500 SA steps already 
matches quite well the expected distribution Peq{xT)- 
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V. CONCLUSION 



reach the theoretical distribution profile. We show snap- 
shots of the process for different SA steps. In each figure, 
the original noisy data, the equilibrium distribution pro- 
file and the evolving profile at the particular SA step 
are shown. At 1500 SA steps, the noisy data completely 
matches with the equilibrium distribution. 
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FIG. 3. (Color online) Evolution of ring factor ^ and critical 
exponent c from three different SA runs. The black dashed 
horizontal lines represent the expected literature value to- 
wards which convergence is expected to occur. 



Generalising our previous approach;^ we here demon- 
strate the outstanding ability of stochastic optimization 
to determine the stability parameters of double-stranded 
DNA from time series of the breathing dynamics of in- 
dividual bps. Even for a short DNA sequence such as 
L33B9 [Eq. (P)] with only 31 internal bps, the conver- 
gence of the chosen SA scheme to all present base stack- 
ing and hydrogen bonding free energies is recovered with 
appreciable accuracy. Even when the input data are per- 
turbed randomly, mimicking noisy experimental or simu- 
lations data, the stochastic optimization technique works 
successfully. 

Optimization based on the bubble distribution Pcq{x) 
is not the only way to extract the DNA stability param- 
eters. For instance, one might use average values for the 
zipping and unzipping rates of individual bps and relate 
their ratio to the underlying free energy difference. Alter- 
natively, once from high throughput fluorescence correla- 
tion experiments an accurate result for the fluorescence 
autocorrelation function becomes available, one might 
use this function as basis for the optimization. In princi- 
ple, one might also modify our approach to analyse data 
from DNA unzipping. This, however, requires detailed 
knowledge on the change of the stacking and hydrogen 
free energies upon stretching of the DNA strands. 

In general, it may be worthwhile to also explore the 
possibility to apply other techniques such as the ge- 
netic algorithm, ®° parallel tempering,— or ant colony 
optimizatiouf^i^ and to compare these methods. 
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